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ABSTRACT 

This paper presents a deterministic linear time complexity 
IDE-SimRank method to approximately compute SimRank 
with proved error bound. SimRank is a well-known simi¬ 
larity measure between graph vertices which relies on graph 
topology only and is built on intuition that ’’two objects 
are similar if they are related to similar objects”. The fixed 
point equation for direct SimRank computation is the dis¬ 
crete Lyapunov equation with specific diagonal matrix in 
the right hand side. The proposed method is based on es¬ 
timation of this diagonal matrix with GMRES and use this 
estimation to compute singe-source and single pairs queries. 
These computations are executed with the part of series con¬ 
verging to the discrete Lyapunov equation solution. 

Keywords 
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1. INTRODUCTION 

This paper presents a new method to efficiently compute 
the SimRank [?] which is a topologically induced similarity 
measure between two given vertices of a graph. 

Similarity measures for graphs are useful in many appli¬ 
cations such as relation mining [?], document-by-document 
querying [?, ?] and many others. A major problem in Sim¬ 
Rank computation is the high storage and time complexity 
of the direct iterative process converging to SimRank. Sev¬ 
eral schemes have been presented for the approximate com¬ 
putation of the SimRank which are based on different con¬ 
cepts [?, ?, ?]. In this paper we propose a two-step method 
for the approximation of the SimRank. The hrst and the 
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most expensive step is the computation where we iteratively 
estimate the diagonal of the SimRank matrix. After that 
the SimRank scores can be computed by the approximate 
solution of the discrete Lyapunov equation. Such approach 
has been considered in [?]. The difference is that instead 
of using Gauss-Seidel method combined with Monte-Garlo 
computations to estimate the diagonal we use numerical lin¬ 
ear algebra techniques. We prove that the linear system for 
the diagonal has bounded condition number, and we have 
an 0(n) matrix-by-vector product with guaranteed accu¬ 
racy, thus inexact GMRES method is the method of choice. 
Using the theory of IGMRES we get a provable 0{n) com¬ 
plexity algorithm for the computation of SimRank scores. 
The final SimRank approximation is a sparse matrix. We 
also provide an efficient practical algorithm for the compu¬ 
tation of SimRank. 


2. PROBLEM STATEMENT 

Let G = (V, E) be a graph, where U is a set of vertices and 
i? is a set of edges. The order of the graph is the number of 
vertices \ V\ = n. Similarity measures between graph vertices 
are very useful in some applications. One of the approach 
to dehne such similarity measure is SimRank [?]. In the 
foundation of SimRank definition lies idea that ’’two objects 
are similar if they are referenced by the similar objects”. It 
is proposed that vertex similarity lies between 0 and 1 with 
vertex being maximally similar to itself with similarity 1. By 
s(a, b) denote the SimRank between vertices a and b dehned 
as 


s(a, b) 


1, if a = 6 

0, if I {a) = 0 or 7(6) = 0 


|r(a)||I(6)| 


\Ha)\\i{b)\ ^ 


^ ^ s(T(a),T(6)), 


otherwise. 


( 1 ) 

where I{v) is the set of in-neighbours of vertex v, constant 
c £ (0,1). Denote by S' a SimRank matrix, where (i, j)-th 
element is SimRank between i-th and y-th vertices. To find 
the SimRank matrix S, one writes s{a, b) for all pairs (a, 6) 
of vertices and obtains linear system of equations which 
has unique solution [?]. Let A be an adjacency matrix of 




the graph G normalized by columns: 


71 

^ ~ I5 J ~ 1) ■ ■ ■ 5 ^5 

i=l 


and 

W = ^cA. 

The SimRank matrix S is the solution of the following equa¬ 
tion: 

S SW-Aia.g{W^ SW)+I = W^ SW + D, (2) 

and D is the diagonal matrix D = — diag(lT^5'tT) -I- I. 
Equation © is typically solved by a fixed-point iteration 

So = I, 

T T 3 

Sk+1 = cA^SkA - cdiag(A^5'fe^) -f I, 

where diag(P) is an operator that maps given matrix P to 
the diagonal matrix with diagonal entries equal the diagonal 
entries of the matrix P. The iteration converges if c < 1. 
Direct usage of m requires 0{n^) memory cells and 0{n^) 
operations for one iteration, thus is infeasible for real-world 
graphs. In this paper we propose a new Iterative Diagonal 
Estimation method (IDE-SimRank method) that approxi¬ 
mately computes the SimRank in 0(n) time and memory. 

3. IDE-SIMRANK METHOD 

In this section we describe IDE-SimRank method and give 
theoretical foundation which proves the reasoning and accu¬ 
racy of our method. 

From m it is easy to get a linear system with n unknowns. 
Since diag S' = / by dehnition, and 

S = W^ SW + D, 

where D is a diagonal matrix, we get 

7 = diag(tT^SIT)-f D = P(D), (4) 

where F{D) is a linear operator that maps a diagonal matrix 
(i.e., a vector of length n) to a diagonal matrix (also a vector 
of length n), and is defined as 

F{D) =DP diag(IT^S(D)IT), (5) 

and S{D) is the solution of the discrete Lyapunov equation 

S{D) = S{D)W P D. 

So, to evaluate F{D) for a given D we have to solve the 
discrete Lyapunov equation and take only the diagonal of the 
solution. This can not be done exactly in 0(n) complexity, 
but it is possible to do approximate computation of F{D) 
with guaranteed accuracy. Moreover,the operator F{D) is 
well-conditioned, so it is natural to use iterative methods 
with inexact matrix-by-vector products to solve Inexact 

GMRES [?] is typically a method of choice. In order to get 
a working method, we need two components: 

1. Estimates for the condition number of the linear oper¬ 
ator F{D). 

2. Algorithm for the computation of F{D) with a given 
accuracy e. (and estimate of its complexity). 


Note that the equation (|4]) was used in the paper [?] un¬ 
der the name Linearized SimRank. The difference in our 
approach is that we use sparse matrix arithmetic and in¬ 
exact iterative method for its solution (compared to the 
Gauss-Seidel method combined with Monte-Carlo estima¬ 
tion to compute 5(D)). 

3.1 Estimation of condition number of f{D) 

Let vec(-) be an operator that maps an n x n matrix to 
a vector of length taking column-by-column. Denote by 
Pv = vec(D(u)) an operator that maps a vector v of length 
n to a vector of length n^, where D{v) is a diagonal matrix 
with V on the diagonal. Now by a slight abuse of notation 
let F and 5 act on a vector d of length n. Then, the matrix 
corresponding to the operator F(d) can be written using 
Kronecker products as 

D = / -k P^(W^ (g) IT'^)(/ - IT^ (g) W^)~^P 
= P^(I -W^ i»W^)~^P. 

The matrix P is the submatrix of the x identity matrix, 
thus D is a submatrix of the matrix (I — (g) W^)~^. 

The matrix F is the submatrix of the inverse M-matrix, 
thus it is also the inverse M-matrix (see [?]), i.e. it is non¬ 
singular. Moreover, its condition number can be bounded. 


Theorem 1. 


r(D)i < 


2(1 + c) 

(1 - c)2 • 


Proof. For simplicity, introduce the matrix 

Z = lT^(g)IT^. 

The matrix Z is nonnegative and \\Z\\i = c. Since D is a 
submatrix of the matrix {I — Z)~^, there exists an x 
permutation matrix Q such that 

Q{I-Z)Q^ = 

and 

Q{I-Z)-^Q^ = 

Using well-known formulas for block matrix inversion, the 
matrix F~^ can be written as the Schur complement 

F~^ = D-BA~^C, 

Consequently, the 1-norm of F~^ is bounded in the following 
way: 

< Pill + IIBIIillA-illillClli 

The matrices D, B, C are submatrices of the matrix I — Z, 
therefore their norms are bounded by ||/ —Z||i < (l-|-c) (the 
norms of the submatrices can not exceed the norm of the 
matrix). To estimate ||A“^||i note that A is also a principal 
submatrix of {I — Z), thus it can be represented as 

{i-z), 

where p||i < c, therefore using the standard Neumann 
series argument 



A B 
C D ’ 


* * 
* F 








Finally, 


It is obvious that 


5k < Vk, 


l|f’-'l|i<(l + c) + 


{1 + cf _ 2(1+ c) 


1 — c 1 — c 
The matrix F is the submatrix of {I — Z)~^, thus 


< 


1-c 


and this completes the proof. □ 

3.2 Fast approximate matrix-by-vector prod¬ 
uct 


where rjk solves 

rjk+i = (1 + c)rik + T, 


which can be solved as 


Vk=T + 


((l + c)'= 

c 


1 ) 


The final result is obtained by using a well-known estimate 
on the remainder of the Neumann series. □ 


The key component for the efficient solution of the sys¬ 
tem (|4]) is the fast evaluation of F{D) for a given D. The 
main computational cost comes from the solution of the dis¬ 
crete Lyapunov equation of the form 

S = W^SW + D, 

where D is a diagonal matrix. The solution can be written 
as 

oo K 

S = + Rk = 

fc=0 '■‘I 

= Sk -f Rk, 

where ||i?ir||i < c^. The truncated series Sk gives an ap¬ 
proximation to S{D) with guaranteed accuracy. 

Algorithm [T] presents fast approximate matvec algorithm 
used further in GMRES. In this algorithm the operator diagonal(-) 
maps given nx n matrix to its n x 1 diagonal. Note, that to 
get 0{n) complexity we have introduced thresholding: the 
elements smaller than r are zeroed out, and all computations 
are implement in sparse matrix arithmetic. 


Algorithm 1: Fast approximate matvec algorithm 
Data: Scaled adjacency nx n matrix W, given n x 1 
vector X, threshold r, number of iterations K. 
Result: Approximate result of matvec y 
^ yo = X 

2 Ao = diag(x) 

3 for k = 1... K do 

4 Afc = W^Xk-iW 

5 d = diagonal(Afc) 

6 Threshold to zero all elements of d, which absolute 
values are less than given threshold r. 

7 yk = yk-i + d 

8 end 

0 y = yK 


The error of the matrix-by-vector product can be esti¬ 
mated by the following theorem. 

Theorem 2. The result of Algorithm\T\ satisfies 


It is easy to get the upper bound on the complexity. The 
number of terms in the SimRank series to get the accuracy 
e can be estimated as log At each step, the diagonal 

of the matrix DW^ has to be computed. Let m be 

an average degree of the vertex. Then the sparsity of is 
bounded by and the evaluation reduces to the evaluation 
of the column norms of the matrix In practice, 

however, this bound is a significant overestimation. 


3.3 Putting it all together 

The GMRES algorithm is summarized in Algorithm[2][?], 
and it is assumed that the matrix-by-vector product is ex¬ 
act. All other operations can be easily implemented in 0{n) 
complexity. 


Algorithm 2: GMRES algorithm for the solution of the 
linear system 


Data: Matrix A, right-hand side b, initial guess xo, 
stopping tolerance e. 

Result: Approximate solution Xm'- IIAa:^ — h|| < e 

1 Start: compute ro = b — Axq, vi = ro/||ro||, Vi = vi, 

P=\\bl 

2 Iterations: 


3 

4 

5 


6 

7 

8 


Orthogonalize: Vk+i = Avk — Vkhk, where hk = Vk Av. 
Normalize: Vk+i = Ufe+i/||ufe+i||. 


Update: I4+i = (14 Vk+i), Hk = 


Hk—i hk 
0 \\vk+i\\ 

where the first column in Hk is omitted when k — 1. 
Solve the least squares problem 


yk = argmiUj^ ||^ei - Hky\\. 

Xm — Xo -j- Vmym- 

Restart: compute ||j"m|| = \\b — 4*^11. Stop if Hr^ll < £■ 
Otherwise: xo = Xm,v\ = rm/||rm|| and start Iterations 
again. 


If the matrix-by-vector products are inexact, the following 
Theorem gives the error bound. 

Theorem 3. /?] After m steps of the inexact GMRES 
procedure, the following estimation for norm of approximate 
and real residues holds: 


lly - y/f ||oo < r-Ttfl- ^ + c^. (8) 

Proof. Suppose that Xk is the result of Algorithm [1] for 
r = 0 after k steps, and 

Xk = Xk + Ek, \\Ek\\c = Sk- 

Then, 


rm - fm\\ < e 


if for any i < m 




a'm{Hm') 


where Om{Hm) is a minimal singular value of the Hessenberg 
matrix corresponding to GMRES process and Ei is an error 
corresponding to matvec on the i-th iteration. 


5k-\-l Si Sk cSk r. 

















4. COMPARISON WITH EXISTING METH¬ 
ODS 

SimRank algorithm computes similarities between vertices 
of the input graph G — {V,E). Here we compute a single¬ 
source SimRank and a one-pair SimRank. The single-source 
SimRank is the vector with SimRank scores between given 
vertex a € V and all other vertices b £ V. The one-pair 
SimRank is the similarity measure s(a, b) between two given 
vertices a and b. 

The proposed method has memory requirement 0{n) and 
computational complexity 0{n) of the pre-computation step. 
Moreover, we compute the sparse approximation to the full 
SimRank matrix. 

For the readers convenience computational and memory 
complexities of the previously proposed methods are pre¬ 
sented Table [T]and[3 

Table 1: Complexities of the single-pair SimRank algorithms 


Paper 

Time 

Memory 

Query 

Precomputation 

[?J 

[?1 

[?1 

This 

0{kN) 

O(r^) 

0{k\E\^) 

Oin) 

0{{N + d)n) 
0{r*n‘^) 

Not needed 
0(n) 

0{nN) 
0{n^r^ -fr"*) 
0{n^) 

Oin) 


every column. The computational cost increases when the 
adjacency matrix becomes more dense, which is natural. 



Figure 1: Dependence of time to solve linear system on n 
for randomly generated graphs with different number of non¬ 
zeros 


Table 2: Complexities of the single-source SimRank algo¬ 
rithms 


Paper 

Time 

Memory 

Query 

Precomputation 

[?1 

©(RfcIS'l) 

OinkiR + PQ)) 

0(m + nP) 

[?] 

0{kd'^) 

Oid'^) 

This 

0(1) 

0(1) 

0(1) 


Some papers like [?], [?], [?], [?] consider a solution of the 
exact Sylvester equation without any estimation as SimRank 
approximation. However, this approach has two fundamen¬ 
tal problems. The first problem is that the exact solution 
of discrete Lyapunov equation is only an approximation to 
the initial SimRank definition, so this solution leads to ad¬ 
ditional errors. We treat this problem by estimation of the 
diagonal item in discrete Lyapunov equation and get the cor¬ 
rect approximate discrete Lyapunov equation for SimRank. 
The second problem is to solve Sylvester equation as pro¬ 
posed in [?] one needs invert adjacency matrix of the graph 
which is unstable and leads to loss of sparsity. Instead of 
invert the matrix of linear operator we use iterative method 
GMRES with fast approximate matvec implementation. 

5. NUMERICAL EXPERIMENTS 
5.1 Synthetic test 

To confirm the 0{n) complexity of the proposed method, 
we generate random adjacency matrices with fixed number 
of nonzero elements in every column and compute SimRank 
for corresponding graphs. The dependence of time on n is 
shown on Figure [T] The other parameters are threshold r = 
10“^, the scale parameter c = 0.6 and number of iteration 
K = 50. Here nnz is a number of non-zero elements in 


The similar plot for dependence of memory to solve the 
linear system on the number of vertices n is presented in 
Figure[2] The other parameters are the same: threshold r = 
10“^, the scale parameter c = 0.6 and number of iteration 
iL = 50. Here nnz is a number of non-zero elements in 
every column. The plot shows that the required memory 
linearly or sub-linearly depends on the number of vertices 
in the graph. But if the adjacency matrix of the graph is 
enough sparse, then the required memory is constant and 
does not depend on the number of vertices. 



Figure 2: Dependence of memory to solve linear system on 
n for randomly generated graphs with different number of 
non-zeros 
































5.2 DIMACSIO collection 

In this section we experimentally study the approximation 
accuracy of our method. The experiments are carried out 
on graphs from DIMACSIO Challenge Collectioini. The list 
of the considered graphs is presented in Tabled 

Table 3: NDCG accuracy for single-source query to the 
graphs from DIMACSIO Challenge Collection 


Name 

n 

nnz 

nnz/n 

1- NDCG@n 

Chesapeake 

39 

340 

8.72 

1.3- lO"'* 

data 

2851 

30186 

10.59 

3.1 • 10"® 

delaunay nlO 

1024 

6112 

5.97 

2- 10“® 

delaunay nil 

2048 

12254 

5.98 

1.2 • 10"® 

delaunay nl2 

4096 

24528 

5.99 

4- 10”® 

delaunay nl3 

8192 

49094 

5.99 

2.5 • 10"'^ 

uk 

4824 

13674 

2.83 

2.74 • 10"^ 

vsp data and 
seymourl 

9167 

111732 

12.19 

10"® 


Table m presents 1—NDCG@n measure for convenience. 
To compute the NDCG@n measure we make q = 100 ran¬ 
dom queries to SimRank and SimRank approximation for 
every graph except Chesapeake graph {q = 39). After that 
we have two vectors s and s of correct SimRank scores and 
approximate SimRank scores between query and all other 
graph vertices. The NDGG measure [?] is defined by the 
following equation: 


1 " 

NDGG = 7^ y 

1 = 1 


2reli _ ^ 

log2(i + l)’ 


where i is an index of concept, according to the sorted ap¬ 
proximate SimRank scores s, reh = Si is the ground-truth 
SimRank score between the query and the i-th vertex, and 
Z is a. normalization constant. The other parameters are 
c = 0.6, K = 50 and r = 10~®. Also nnz is the total num¬ 
ber of non-zero elements in the adjacency matrix, nnzjn is 
the average degree of vertex. 


5.3 Experiment with Wikipedia 

We used Simple English Wikipedia corpus to find semantic 
relatedness between concepts. The undirected graph corre¬ 
sponding to Simple Wikipedia corpus has 150495 vertices 
and 4454023 edges. The direct SimRank matrix computa¬ 
tion of such large graph is infeasible. Therefore, we assess 
the quality of our SimRank approximation method not with 
approximation error but with rationality of the obtained 
similar concepts. We use the following parameters in the 
experiment: c = 0.6, number of iteration k — 10, r = 10~^. 

Table|4]shows some examples of similar concepts extracted 
from Simple English Wikipedia corpus by the proposed Sim¬ 
Rank approximation algorithm. Each column hasthe queried 
concept in the top and the most similar concepts to the 
queried one in the other rows. Every column is sorted ac¬ 
cording to SimRank scores given by SimRank matrix ap¬ 
proximation. We do not display these scores because of the 
space limitation: the scores differ in 4-th or 5-th significant 
figures. 



Figure 3: Dependence NDGG on the thresholds for graph 
from DIMACSIO collection 



Figure 4: Dependence of time to solve linear system on the 
thresholds for graph from DIMACSIO collection 


An important research direction is the study of hyper¬ 
graphs, when the adjacency matrix is replaced by the adja¬ 
cency tensor. We plan to investigate this issue. Also, the 
computation of the SimRank by summing of the Neumann 
series can be improved by using more advanced iterative 
method, like the IGMRES approach considered here, but it 
requires a lot of technical work. 
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Figure 5: Convergence of the GMRES with different thresh¬ 
olds T 


Table 4: Similar concepts according to proposed SimRank 
approximation algorithm 


GNU 

Earth 

Liquid 

Richard Matthew 

Stallman 

South Pole- 
Aitken basin 

Plasma (matter) 

Linux operating sys¬ 
tem 

Frame of 

referance 

Matters 

Hurd 

Interplanetary 

internet 

Particle theory of 
matter 

Debian linux 

Supernova 

1987A 

Hematological 

Linux (kernel) 

Probotector 

Elude 

*nix 

Near Earth 
Object 

Human blood 
















